Introduction

For the first analysis module, our goal will be to predict quantitative variable. Since the dataset is more suitable for classification, only a single quantitative analysis will be made.

To start the analysis, let’s import the required dataset; due to the model needs, the aggregated dataset will be used.

# Load dataset from cache
if (file.exists("Cache/traffic_hourly.rds") && file.exists("Cache/sf_gatesGPS.rds")) {
  sf_gatesGPS <- readRDS("Cache/sf_gatesGPS.rds")
  df_hourly <- readRDS("Cache/traffic_hourly.rds")
  
  # Change id_varco to its real location
  df_hourly <- df_hourly %>%
    left_join(sf_gatesGPS %>% 
                st_drop_geometry() %>% 
                dplyr::select(id_amat, label),
                by = c("id_varco" = "id_amat"))
  df_hourly$id_varco <- NULL
  df_hourly$label <- as.factor(df_hourly$label)

  message("Data loaded successfully.")
  } else {
  stop("Run step 01 first.")
}
## Data loaded successfully.

Since we’re starting to predict, we’ll need some data to verify if our model will work also with data not included in the training phase. To do so, we split the database in two giving 80% to the training phase while reserving 20% for the tests.

# Split Train/Test (80/20)
set.seed(123)
train_index <- sample(1:nrow(df_hourly), 0.8 * nrow(df_hourly))
train_data <- df_hourly[train_index, ]
test_data <- df_hourly[-train_index, ]

rm(df_hourly, train_index)

# Remove NA (they'reare just 23)
train_data <- na.omit(train_data)
test_data <- na.omit(test_data)

Traffic Volume Prediction

What leads to more traffic?

The question we’ll try to answer is related to traffic volume prediction. We’ll start from a basic Multiple Linear Regression considering as predictors:

  • hour since will be probably the most influent predictor
  • day_of_week since from EDA we discovered how the traffic differs from Weekday and Weekends
  • month to capture holidays and the season effect

As outcome, we’ll look for toral_transits.

Method 1: Multiple Linear Regressionù

This model will surely perform poorly since we’re considering the phenomenon as a linear one even if, from EDA, we found out that the traffic follows a “M” pattern. This model will simple act as a baseline to measure the improvements. Based on the chosen predictors, we’ll compute: \[TotalTransits = \beta_0 + \beta_1(Hour) + \beta_2(DayOfWeek) + \beta_3(Month)\epsilon\] [!NOTE] To avoid the Intercept to be represented by the first gate on Monday of January, we’ll switch to sum coding where the Intercept will be represented by the traffic global mean. Doing so, will allow us to interpret the influence of the other pretictor as the difference between itself and the mean value. From a prediction point of view this modification will not change anything, but from an inference point of view this will help us to answer better the research question.

# Change the contrast settings
options(contrasts = c("contr.sum", "contr.poly"))

# Model
linear_model <- lm(total_transits ~ hour + day_of_week + month, data = train_data)
summary(linear_model)
## 
## Call:
## lm(formula = total_transits ~ hour + day_of_week + month, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -142.70  -79.90  -38.18   40.30  816.96 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    74.28634    1.67256  44.415  < 2e-16 ***
## hour            2.03760    0.03254  62.615  < 2e-16 ***
## day_of_week.L -11.58074    0.59806 -19.364  < 2e-16 ***
## day_of_week.Q -21.52406    0.59783 -36.003  < 2e-16 ***
## day_of_week.C  -5.32076    0.59621  -8.924  < 2e-16 ***
## day_of_week^4  -2.76670    0.59602  -4.642 3.45e-06 ***
## day_of_week^5   0.47916    0.59471   0.806 0.420415    
## day_of_week^6   0.03164    0.59582   0.053 0.957646    
## month.L       -33.24295    8.90958  -3.731 0.000191 ***
## month.Q       -20.40645    9.71535  -2.100 0.035692 *  
## month.C       -15.33353    8.90941  -1.721 0.085243 .  
## month^4       -44.87842    7.15729  -6.270 3.61e-10 ***
## month^5       -36.20081    5.10555  -7.090 1.34e-12 ***
## month^6       -10.02906    3.25804  -3.078 0.002082 ** 
## month^7        19.47489    1.89976  10.251  < 2e-16 ***
## month^8         6.70016    1.11666   6.000 1.97e-09 ***
## month^9       -13.27106    0.81866 -16.211  < 2e-16 ***
## month^10      -22.06306    0.75342 -29.284  < 2e-16 ***
## month^11      -18.63772    0.74640 -24.970  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114.2 on 257222 degrees of freedom
## Multiple R-squared:  0.04044,    Adjusted R-squared:  0.04037 
## F-statistic: 602.2 on 18 and 257222 DF,  p-value: < 2.2e-16

Unfortunaly, we’re not be able to see, for example, the comparison of all the days with the Intercept since teh last level (n-1) is implicit. By default, R will “hide” the last item. To be sure, we can print the levels and identify the last one (Sunday and December will be the ghosts).

levels(train_data$day_of_week)
## [1] "Mon" "Tue" "Wed" "Thu" "Fri" "Sat" "Sun"
levels(train_data$month)
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"

Brief comment

As expected, the model performed really poorly:

Looking at the R-Squared Adjusted the multiple linear regression model explain only the 4% of all the traffic variability; the remaining 96% is considered noise or lost information.

The Residual Standard Error is confirming that the model is unable to predict correctly the traffic by having an error of 114 vehicle over the 74 vehicles represented in the Intercept. It’s basically useless.

However, the (almost) all thep-values have a really low value; the time is surelly influencing the traffic.

The residuals and the negative median (-38) indicate that the model really underestimate the traffic while being unable to detects the peaks. To be fair, we were aware of the non-linearity of the model since the EDA has clearly shown a “M” shape that a simple model like this one can’t simply model.

Method 2: Polynomial Regression

We really need to consider the non-linearity nature of the data. Here, we’ll give the line to bend by using the Polynomial Regression:

\[TotalTransits = \beta_0 + \beta_1(Hour) + \beta_2(Hour^2) + \beta_3(Hour^3) + \beta_4(Month) + \beta_5(DayOfWeek)...\] The quadratic term will alow a parabolic line (an ascend and a descend) while the cubic term will allow for two curves (one in the morning, one in the afternoon).

# Change the contrast settings
options(contrasts = c("contr.sum", "contr.poly"))

poly_model <- lm(total_transits ~ poly(hour, 4) + day_of_week + month, data = train_data)

summary(poly_model)
## 
## Call:
## lm(formula = total_transits ~ poly(hour, 4) + day_of_week + month, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -170.79  -67.65  -22.60   37.69  789.65 
## 
## Coefficients:
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     1.043e+02  1.497e+00   69.635  < 2e-16 ***
## poly(hour, 4)1  7.144e+03  1.053e+02   67.835  < 2e-16 ***
## poly(hour, 4)2 -2.240e+04  1.053e+02 -212.636  < 2e-16 ***
## poly(hour, 4)3  1.140e+03  1.053e+02   10.825  < 2e-16 ***
## poly(hour, 4)4  1.058e+03  1.053e+02   10.041  < 2e-16 ***
## day_of_week.L  -1.183e+01  5.514e-01  -21.455  < 2e-16 ***
## day_of_week.Q  -2.161e+01  5.511e-01  -39.210  < 2e-16 ***
## day_of_week.C  -5.252e+00  5.496e-01   -9.554  < 2e-16 ***
## day_of_week^4  -2.823e+00  5.495e-01   -5.138 2.77e-07 ***
## day_of_week^5   3.181e-01  5.483e-01    0.580 0.561752    
## day_of_week^6  -2.713e-02  5.493e-01   -0.049 0.960610    
## month.L         2.861e+00  8.220e+00    0.348 0.727777    
## month.Q         1.890e+01  8.964e+00    2.108 0.034996 *  
## month.C         2.091e+01  8.220e+00    2.544 0.010961 *  
## month^4        -1.576e+01  6.604e+00   -2.387 0.016984 *  
## month^5        -1.575e+01  4.711e+00   -3.345 0.000824 ***
## month^6         2.766e+00  3.006e+00    0.920 0.357401    
## month^7         2.636e+01  1.753e+00   15.043  < 2e-16 ***
## month^8         1.016e+01  1.030e+00    9.864  < 2e-16 ***
## month^9        -1.207e+01  7.548e-01  -15.989  < 2e-16 ***
## month^10       -2.155e+01  6.946e-01  -31.030  < 2e-16 ***
## month^11       -1.841e+01  6.881e-01  -26.758  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105.3 on 257219 degrees of freedom
## Multiple R-squared:  0.1845, Adjusted R-squared:  0.1844 
## F-statistic:  2771 on 21 and 257219 DF,  p-value: < 2.2e-16

Brief comment

Introducing the flexibility offered by the polynomial model, we can see a small increase of the R-Squared Adjusted that is now 18%. It’s not good, but it confirms that the traffic isn’t a cumulative phenomenon but cyclic.

The coefficients poly(hour, 4) are offering a more realistic scenario:

  • poly(hour, 4)2 has the highest value of -2.240e+04, indicating a inverded “U” probably representing the growing curve of the morning traffic respect the pacific night
  • the terms 3 and 4 by being significative (looking at the p-value) are suggesting that we’re not doing enough: we need more curves to express the differences between day and night

The errors represented by the **Residual Standard Error* are sill high: 105 is better than the previous 114 but not sufficient to be considered good. This is clearly saying that using an unique mean for Milan is not a good idea. Maybe is because we’re considering a mean of all the gates?

Method 3: Polynomial Interaction Model

With the 3rd approach, we’ll combine two improvement:

  • Include the variable label to stop considering all the label with equal importance
  • Transform the model into an Iterative one; this will be useful since instead of assuming that the Monday curve is the same of the Sunday one, the hour predictor can interact with the day of week and create a better prediction. The EDA analysis suggests that: on weekday at 8:30 we have different curves than Sunday at the same hour.
# Change the contrast settings
options(contrasts = c("contr.sum", "contr.poly"))

# Model 
interact_model <- lm(total_transits ~ poly(hour, 4) * day_of_week + month + label, data = train_data)

summary(interact_model)
## 
## Call:
## lm(formula = total_transits ~ poly(hour, 4) * day_of_week + month + 
##     label, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315.41  -35.89   -4.78   30.44  638.94 
## 
## Coefficients:
##                                Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                   1.048e+02  8.816e-01  118.925  < 2e-16 ***
## poly(hour, 4)1                7.193e+03  6.164e+01  116.694  < 2e-16 ***
## poly(hour, 4)2               -2.241e+04  6.164e+01 -363.599  < 2e-16 ***
## poly(hour, 4)3                1.091e+03  6.164e+01   17.696  < 2e-16 ***
## poly(hour, 4)4                1.033e+03  6.164e+01   16.759  < 2e-16 ***
## day_of_week.L                -1.211e+01  3.227e-01  -37.522  < 2e-16 ***
## day_of_week.Q                -2.151e+01  3.226e-01  -66.693  < 2e-16 ***
## day_of_week.C                -5.208e+00  3.217e-01  -16.190  < 2e-16 ***
## day_of_week^4                -2.830e+00  3.216e-01   -8.801  < 2e-16 ***
## day_of_week^5                 4.547e-01  3.209e-01    1.417 0.156469    
## day_of_week^6                 5.090e-02  3.215e-01    0.158 0.874191    
## month.L                       5.401e+00  4.840e+00    1.116 0.264418    
## month.Q                       2.124e+01  5.278e+00    4.024 5.72e-05 ***
## month.C                       2.367e+01  4.840e+00    4.891 1.00e-06 ***
## month^4                      -1.369e+01  3.888e+00   -3.520 0.000431 ***
## month^5                      -1.404e+01  2.773e+00   -5.064 4.10e-07 ***
## month^6                       4.205e+00  1.769e+00    2.376 0.017479 *  
## month^7                       2.687e+01  1.031e+00   26.064  < 2e-16 ***
## month^8                       1.030e+01  6.048e-01   17.033  < 2e-16 ***
## month^9                      -1.212e+01  4.422e-01  -27.403  < 2e-16 ***
## month^10                     -2.169e+01  4.066e-01  -53.359  < 2e-16 ***
## month^11                     -1.830e+01  4.027e-01  -45.447  < 2e-16 ***
## label1                        8.330e+01  7.576e-01  109.957  < 2e-16 ***
## label2                        1.695e+01  7.591e-01   22.323  < 2e-16 ***
## label3                       -8.056e+01  7.562e-01 -106.540  < 2e-16 ***
## label4                       -9.341e+01  7.587e-01 -123.126  < 2e-16 ***
## label5                       -1.831e+01  7.553e-01  -24.245  < 2e-16 ***
## label6                        7.280e+01  7.611e-01   95.641  < 2e-16 ***
## label7                       -2.474e+01  7.602e-01  -32.547  < 2e-16 ***
## label8                        1.169e+02  7.599e-01  153.790  < 2e-16 ***
## label9                       -2.880e+01  7.594e-01  -37.926  < 2e-16 ***
## label10                      -1.082e+01  7.608e-01  -14.222  < 2e-16 ***
## label11                      -3.857e+01  7.554e-01  -51.054  < 2e-16 ***
## label12                      -8.488e+01  7.599e-01 -111.698  < 2e-16 ***
## label13                       1.154e+02  7.606e-01  151.675  < 2e-16 ***
## label14                      -5.750e+01  7.598e-01  -75.676  < 2e-16 ***
## label15                      -8.705e+01  7.599e-01 -114.555  < 2e-16 ***
## label16                       7.932e+01  7.605e-01  104.287  < 2e-16 ***
## label17                      -3.486e+01  7.599e-01  -45.881  < 2e-16 ***
## label18                      -4.872e+01  7.579e-01  -64.284  < 2e-16 ***
## label19                       1.283e+02  7.629e-01  168.168  < 2e-16 ***
## label20                      -9.094e+01  7.619e-01 -119.354  < 2e-16 ***
## label21                      -9.441e+01  7.556e-01 -124.947  < 2e-16 ***
## label22                       1.064e+02  7.576e-01  140.397  < 2e-16 ***
## label23                      -5.102e+00  7.613e-01   -6.702 2.06e-11 ***
## label24                       9.384e+00  7.561e-01   12.411  < 2e-16 ***
## label25                      -6.812e+01  7.600e-01  -89.623  < 2e-16 ***
## label26                      -3.899e+01  7.577e-01  -51.463  < 2e-16 ***
## label27                       4.093e+01  7.574e-01   54.032  < 2e-16 ***
## label28                       5.119e+01  7.588e-01   67.463  < 2e-16 ***
## label29                      -8.052e+01  7.608e-01 -105.836  < 2e-16 ***
## label30                      -8.166e+01  7.560e-01 -108.013  < 2e-16 ***
## label31                       1.430e+02  7.631e-01  187.454  < 2e-16 ***
## label32                      -7.922e+01  7.569e-01 -104.664  < 2e-16 ***
## label33                      -9.332e+01  7.598e-01 -122.824  < 2e-16 ***
## label34                      -6.407e+00  7.570e-01   -8.463  < 2e-16 ***
## label35                      -6.971e+01  7.575e-01  -92.031  < 2e-16 ***
## label36                       2.473e+02  7.554e-01  327.393  < 2e-16 ***
## label37                       1.641e+02  7.592e-01  216.080  < 2e-16 ***
## label38                      -2.401e+01  7.596e-01  -31.609  < 2e-16 ***
## label39                       7.214e-01  7.556e-01    0.955 0.339748    
## poly(hour, 4)1:day_of_week.L  6.811e+03  1.634e+02   41.693  < 2e-16 ***
## poly(hour, 4)2:day_of_week.L  7.575e+03  1.633e+02   46.383  < 2e-16 ***
## poly(hour, 4)3:day_of_week.L -9.423e+03  1.633e+02  -57.690  < 2e-16 ***
## poly(hour, 4)4:day_of_week.L  3.626e+03  1.633e+02   22.197  < 2e-16 ***
## poly(hour, 4)1:day_of_week.Q -2.113e+02  1.634e+02   -1.293 0.196019    
## poly(hour, 4)2:day_of_week.Q  3.190e+03  1.635e+02   19.517  < 2e-16 ***
## poly(hour, 4)3:day_of_week.Q -6.861e+03  1.634e+02  -41.979  < 2e-16 ***
## poly(hour, 4)4:day_of_week.Q  1.978e+03  1.634e+02   12.100  < 2e-16 ***
## poly(hour, 4)1:day_of_week.C -1.229e+03  1.630e+02   -7.538 4.80e-14 ***
## poly(hour, 4)2:day_of_week.C -2.071e+03  1.630e+02  -12.704  < 2e-16 ***
## poly(hour, 4)3:day_of_week.C -2.842e+03  1.631e+02  -17.426  < 2e-16 ***
## poly(hour, 4)4:day_of_week.C -9.318e+00  1.630e+02   -0.057 0.954418    
## poly(hour, 4)1:day_of_week^4 -2.008e+03  1.630e+02  -12.318  < 2e-16 ***
## poly(hour, 4)2:day_of_week^4 -1.051e+03  1.629e+02   -6.453 1.10e-10 ***
## poly(hour, 4)3:day_of_week^4  1.773e+01  1.629e+02    0.109 0.913321    
## poly(hour, 4)4:day_of_week^4 -1.253e+03  1.630e+02   -7.685 1.54e-14 ***
## poly(hour, 4)1:day_of_week^5 -1.545e+03  1.627e+02   -9.498  < 2e-16 ***
## poly(hour, 4)2:day_of_week^5 -2.220e+02  1.628e+02   -1.364 0.172682    
## poly(hour, 4)3:day_of_week^5  1.910e+03  1.628e+02   11.734  < 2e-16 ***
## poly(hour, 4)4:day_of_week^5 -1.583e+03  1.627e+02   -9.728  < 2e-16 ***
## poly(hour, 4)1:day_of_week^6 -6.796e+01  1.630e+02   -0.417 0.676675    
## poly(hour, 4)2:day_of_week^6  2.875e+02  1.631e+02    1.763 0.077923 .  
## poly(hour, 4)3:day_of_week^6  9.015e+02  1.630e+02    5.530 3.20e-08 ***
## poly(hour, 4)4:day_of_week^6 -6.211e+02  1.630e+02   -3.810 0.000139 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61.62 on 257156 degrees of freedom
## Multiple R-squared:  0.7208, Adjusted R-squared:  0.7207 
## F-statistic:  7902 on 84 and 257156 DF,  p-value: < 2.2e-16

Brief comment

THis approach outsmarts all the previous. The synergy between time and space is the key: the Residual Standard Error is now decreased to 61; not the perfect result, but a significant improvement: the label36 itself has an estimate of 247.3 just by itself. Introducing the interaction has bring interesting results noticeable looking at poly(hour, 4)1:day_of_week results. We’re now able to say that the shape of the day changes between the week and, by looking at the p-values, that the peaks are not standard or fixed.

Overall the R-Squared Adjusted of 72% is a great result, but the residuals still show a maximum error of 638: Milan’s traffic it’s almost never the same.

Bonus Method: GAM

For a quick moment, we’ll step out from linear model in favor of a Generalized Adattive Model. GAMs offer a very flexible spline that may represent the solution for this question. As bonus analysis we’ll try to predict the total transits as a spline while keeping the interaction creating a different curve for every different day of week.

Please note that instead of using the default Gaussian Distribution as family, we’ll ue the Negative Binomial one. This choice has been made since the Gaussian one “suffers” the presence of very large outliers. Traffic data is counted and often suffers from overdispersion, but in a Gaussian distribution the variance is constant. In real traffic data variance tends to increase as traffic increases: when there is little traffic, the variance is low; when there is a lot of traffic, the variance explodes. Using Negative Binomial should handle all the spikes that the Gaussian tends to ignore or treat as errors.

gam_model <- gam(total_transits ~ s(hour, by = day_of_week, k = 20) + 
             day_of_week + month + label, 
             data = train_data, 
             family = nb(),
             method = "REML")

summary(gam_model)
## 
## Family: Negative Binomial(5.342) 
## Link function: log 
## 
## Formula:
## total_transits ~ s(hour, by = day_of_week, k = 20) + day_of_week + 
##     month + label
## 
## Parametric coefficients:
##                Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)    4.026840   0.007017  573.851  < 2e-16 ***
## day_of_week.L -0.209191   0.002453  -85.286  < 2e-16 ***
## day_of_week.Q -0.085270   0.002453  -34.769  < 2e-16 ***
## day_of_week.C -0.216520   0.002453  -88.285  < 2e-16 ***
## day_of_week^4  0.016604   0.002457    6.758 1.40e-11 ***
## day_of_week^5 -0.029881   0.002453  -12.180  < 2e-16 ***
## day_of_week^6  0.009069   0.002458    3.690 0.000224 ***
## month.L        0.055273   0.038540    1.434 0.151523    
## month.Q        0.278171   0.042030    6.618 3.63e-11 ***
## month.C        0.292335   0.038540    7.585 3.32e-14 ***
## month^4       -0.135920   0.030954   -4.391 1.13e-05 ***
## month^5       -0.149363   0.022068   -6.768 1.30e-11 ***
## month^6        0.037129   0.014055    2.642 0.008250 ** 
## month^7        0.300040   0.008152   36.807  < 2e-16 ***
## month^8        0.140069   0.004712   29.725  < 2e-16 ***
## month^9       -0.129219   0.003369  -38.359  < 2e-16 ***
## month^10      -0.279213   0.003099  -90.104  < 2e-16 ***
## month^11      -0.220104   0.003064  -71.835  < 2e-16 ***
## label1         0.996342   0.005459  182.499  < 2e-16 ***
## label2         0.512138   0.005539   92.452  < 2e-16 ***
## label3        -1.117396   0.006126 -182.408  < 2e-16 ***
## label4        -1.914765   0.006866 -278.894  < 2e-16 ***
## label5         0.279371   0.005559   50.258  < 2e-16 ***
## label6         0.868142   0.005500  157.838  < 2e-16 ***
## label7         0.087067   0.005637   15.445  < 2e-16 ***
## label8         1.194323   0.005454  218.973  < 2e-16 ***
## label9         0.042157   0.005646    7.466 8.25e-14 ***
## label10        0.313198   0.005591   56.017  < 2e-16 ***
## label11       -0.091554   0.005653  -16.195  < 2e-16 ***
## label12       -1.271815   0.006262 -203.093  < 2e-16 ***
## label13        1.229246   0.005455  225.363  < 2e-16 ***
## label14       -0.320342   0.005757  -55.641  < 2e-16 ***
## label15       -1.390716   0.006349 -219.056  < 2e-16 ***
## label16        0.997340   0.005480  182.005  < 2e-16 ***
## label17        0.104445   0.005629   18.553  < 2e-16 ***
## label18       -0.315724   0.005740  -55.004  < 2e-16 ***
## label19        1.191680   0.005475  217.656  < 2e-16 ***
## label20       -1.664789   0.006614 -251.715  < 2e-16 ***
## label21       -2.010901   0.006959 -288.948  < 2e-16 ***
## label22        1.173818   0.005438  215.846  < 2e-16 ***
## label23        0.308652   0.005595   55.168  < 2e-16 ***
## label24        0.443842   0.005533   80.217  < 2e-16 ***
## label25       -0.575083   0.005864  -98.076  < 2e-16 ***
## label26       -0.102328   0.005672  -18.040  < 2e-16 ***
## label27        0.750582   0.005488  136.767  < 2e-16 ***
## label28        0.782226   0.005494  142.369  < 2e-16 ***
## label29       -0.971712   0.006067 -160.162  < 2e-16 ***
## label30       -1.096469   0.006108 -179.505  < 2e-16 ***
## label31        1.310715   0.005466  239.779  < 2e-16 ***
## label32       -1.062379   0.006089 -174.475  < 2e-16 ***
## label33       -1.886142   0.006836 -275.920  < 2e-16 ***
## label34        0.274704   0.005572   49.297  < 2e-16 ***
## label35       -0.725008   0.005908 -122.716  < 2e-16 ***
## label36        1.649808   0.005385  306.388  < 2e-16 ***
## label37        1.391813   0.005431  256.265  < 2e-16 ***
## label38        0.054974   0.005644    9.741  < 2e-16 ***
## label39        0.483777   0.005519   87.662  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                          edf Ref.df Chi.sq p-value    
## s(hour):day_of_weekTue 18.79  18.99 106720  <2e-16 ***
## s(hour):day_of_weekWed 18.81  18.99 104138  <2e-16 ***
## s(hour):day_of_weekThu 18.78  18.99  91740  <2e-16 ***
## s(hour):day_of_weekFri 18.74  18.99  77058  <2e-16 ***
## s(hour):day_of_weekSat 18.62  18.98  58742  <2e-16 ***
## s(hour):day_of_weekSun 18.72  18.99  63191  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.843   Deviance explained = 85.2%
## -REML = 1.1876e+06  Scale est. = 1         n = 257241

Brief comment

Using a spline like s(hour) has increased again the quality of the model bringing the R-Squared Adjusted up to 84%. The deviance explained is also confirming that the GAM approach is the best one due to the model’s ability to explain almost all the phenomenom, probably also thanks to the generous value k=20 offering 20 degrees of freedom.

Models verification

So far, we evaluated the metrics based on the training dataset. To verify the bounty of the mdodels is necessary to test them with the dedicated testing dataset. To keep it simple, we’ll perform the checks just over the greatest models: the Interaction and GAM ones.

A “real_profile” will be added to visualize how the model predictions differ from the average data in the dataset.

# Real profile
df_hourly <- readRDS("Cache/traffic_hourly.rds")

real_profile <- df_hourly %>%
  group_by(hour) %>%
  summarise(real_mean = mean(total_transits, na.rm = TRUE), .groups = "drop")

rm(df_hourly)


# Prediction grid
pred_grid <- test_data %>%
  select(hour, day_of_week, month, label) %>%
  distinct()

pred_grid$pred_interact <- predict(interact_model, newdata = pred_grid)
pred_grid$pred_gam <- predict(gam_model, newdata = pred_grid, type = "response")


# Prediction aggregation by hour
model_profiles <- pred_grid %>%
  group_by(hour) %>%
  summarise(IterativePoly = mean(pred_interact, na.rm = TRUE),
            GAM = mean(pred_gam, na.rm = TRUE),
            .groups = "drop")

# Plot
plot_data <- model_profiles %>%
  inner_join(real_profile, by = "hour") %>%
  rename(Mean2024 = real_mean) %>%
  pivot_longer(cols = -hour, names_to = "Source", values_to = "Transits")

p_final <- ggplot(plot_data, aes(x = hour, y = Transits, color = Source, linetype = Source)) +
                  geom_line(size = 1) +
                  geom_vline(xintercept = c(7.5, 19.5), linetype = "dashed", color = "#525252", alpha = 0.8) +
                  scale_color_manual(values = c("Mean2024" = "#252525",
                                                "IterativePoly" = "#d7191c",
                                                "GAM" = "#2c7bb6")) +
                  scale_linetype_manual(values = c("Mean2024" = "dot", 
                                                   "IterativePoly" = "solid", 
                                                   "GAM" = "solid")) +
                  scale_x_continuous(breaks = 0:23) +
                  labs(title = "Model Comparison",
                       x = "Hour", 
                       y = "Hourly mean transits (per gate)", 
                       color = "Legenda", 
                       linetype = "Legenda") +
                  theme_minimal()
ggplotly(p_final)

The GAM model with the Negative Binomial family has definetly won, even if it tents to overestimate in the night and underestimate over the day. The Iterative Polynomial Regression doesn’t bend sufficiently to show the already seen “M” shape of the traffic; probably perform more interactions may improve the results, but we’ll stick to GAMs.

Just to be sure even from a empiric point of view, let’s compute the Root Mean Squared Error to observe the residuals’ standard deviation.

# Iterative Polynomial
preds_interact <- predict(interact_model, newdata = test_data)
rmse_interact <- sqrt(mean((test_data$total_transits - preds_interact)^2))

# GAM
preds_gam <- predict(gam_model, newdata = test_data, type = "response")
rmse_gam <- sqrt(mean((test_data$total_transits - preds_gam)^2))

# Print
cat("RMSE Iterative Polynomial:", round(rmse_interact, 2), "\n")
## RMSE Iterative Polynomial: 61.68
cat("RMSE GAM:", round(rmse_gam, 2), "\n")
## RMSE GAM: 46.17

We have the confirm: the GAM model performs better by getting wrong on average of 62 vehicles. Still is not perfect but, considering the traffic non-linear behavior, it’s a good result.

Answering the question

The increase in traffic within Milan’s Area C appears to be attributable not to a single factor, but rather to a combination of social habits, environmental policies, and external variables.

The main driver of congestion is undoubtedly the urban work routine: data show traffic peaks around the start and end times of work, while volume drops dramatically on weekends. This trend is also confirmed on a seasonal scale, with sharp reductions in traffic during holiday periods such as August and December.

Regarding the composition of the vehicle fleet, the analysis highlights how the spread of hybrid vehicles is driving the maintenance of high volumes. While regulations discourage the most polluting vehicles, facilitated or free access for hybrid cars appears to offset this decline, preventing an actual reduction in the total number of vehicles on the road.

Commercial traffic maintains a constant and less elastic presence throughout the day, being less affected by time variations than private cars.